{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Sampling and Hypothesis Testing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Random Variables | Examples in R"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select this cell and type Ctrl-Enter to execute the code below.\n",
    "\n",
    "install.packages(\"statip\")\n",
    "library(statip)\n",
    "\n",
    "set_plot_dimensions <- function(width_choice, height_choice) {\n",
    "    options(repr.plot.width=width_choice, repr.plot.height=height_choice)\n",
    "}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bernoulli Distribution\n",
    "\n",
    "The simplest discrete probability distribution is the **Bernoulli distribution**: \n",
    "\n",
    "$$B \\sim \\text{Bernoulli}(p)$$\n",
    "\n",
    "This describes a situation where there are only two possible outcomes, labelled \"success\" ($B=1$) and \"failure\" ($B=0$).\n",
    "\n",
    "The probability of obtaining a success is a constant, $p$.\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    " \\mathbb{P}(B = x) &= \\begin{cases}\n",
    "  p & \\text{for $x=1$}\\\\\n",
    "  1-p & \\text{for $x=0$}\n",
    "  \\end{cases}\n",
    "\\\\\n",
    "\\\\\n",
    "\\mathbb{E}B &= 1 \\cdot p + 0 \\cdot (1-p) = p\n",
    "\\\\\n",
    "\\\\\n",
    "\\text{Var}B &= \\mathbb{E}(B-p)^2 = (1-p)^2 \\cdot p + (0-p)^2 \\cdot (1-p) = p(1-p)\n",
    "\\end{align*}\n",
    "$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example: rolling a six with one die\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p <- 1/6\n",
    "x <- 0:1\n",
    "pmf <- dbern(x,p)  # a Bernoulli distribution with p=1/6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the probability mass function\n",
    "\n",
    "set_plot_dimensions(4, 4)\n",
    "plot(x, pmf, ylim=c(0,1), xlab=\"x\", type=\"h\", col=\"red\", axes=FALSE)\n",
    "points(x, pmf, col=\"red\")\n",
    "axis(side=1, at=c(0,1))\n",
    "axis(side=2)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the expected value = p\n",
    "\n",
    "expectation <- sum(x * pmf)\n",
    "expectation\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the variance = p * (1-p)\n",
    "\n",
    "variance <- sum( (x - expectation)^2 * pmf)\n",
    "variance\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Binomial Distribution\n",
    "\n",
    "If $X$ is the number of successes in $n$ *independent and identically distributed* (i.i.d.) Bernoulli trials, with probability of success $p$, then $X$ is said to follow a **binomial distribution**: \n",
    "\n",
    "$$X = B_{1} + ... + B_{n} \\sim \\text{binom}(n,p)$$\n",
    "\n",
    "The probability of obtaining $x$ successes is given by\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "  \\mathbb{P}(X = x) &= \\binom{n}{x}p^{x}(1-p)^{n-x}.\n",
    "\\\\\n",
    "\\\\\n",
    "\\mathbb{E}X &= \\mathbb{E}( B_{1} + \\cdots + B_{n} ) = \\mathbb{E}B_{1} + \\cdots + \\mathbb{E}B_{n} = np\n",
    "\\\\\n",
    "\\\\\n",
    "\\text{Var}X &= \\text{Var}( B_{1} + \\cdots + B_{n} ) = \\text{Var}B_{1} + \\cdots + \\text{Var}B_{n} = np(1-p)\n",
    "\\end{align*}\n",
    "$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example: number of sixes obtained when rolling ten dice\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n <- 10\n",
    "p <- 1/6\n",
    "x <- 0:n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the probability mass function\n",
    "\n",
    "pmf <- dbinom(x,n,p)  # a binomial distribution with n=10, p=1/6\n",
    "\n",
    "set_plot_dimensions(5, 4)\n",
    "plot(x, pmf, ylim=c(0,0.4), xlab=\"x\", type=\"h\", col=\"red\", axes=FALSE)\n",
    "points(x, pmf, col=\"red\")\n",
    "axis(side=1, at=0:10)\n",
    "axis(side=2)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the cumulative distribution function\n",
    "\n",
    "cdf <- pbinom(x,n,p)\n",
    "\n",
    "set_plot_dimensions(5, 4)\n",
    "plot(x, cdf, ylim=c(0,1), xlab=\"x\", type=\"s\", col=\"blue\", axes=FALSE)\n",
    "axis(side=1, at=x)\n",
    "axis(side=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the expected value = n * p\n",
    "\n",
    "expectation <- sum(x * pmf)\n",
    "expectation\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the variance = n * p * (1-p)\n",
    "\n",
    "variance <- sum( (x - expectation)^2 * pmf)\n",
    "variance\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculate the probability of rolling one or more sixes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "1 - dbinom(0,n,p)  # using the PMF at x=0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Poisson Distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **Poisson distribution** describes the number of observations of an event that is randomly distributed in space or time.\n",
    "\n",
    "$$X \\sim \\text{Poisson}(\\lambda)$$\n",
    "\n",
    "e.g., number of radioactive decays in a second, number of accidents in a year, number of mutations on a chromosome.\n",
    "\n",
    "The probability of observing $x$ events is given by\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "  \\mathbb{P}(X = x) &= \\frac{e^{-\\lambda}\\lambda^{x}}{x!} \\text{ for } x=0,1,2,...\n",
    "\\\\\n",
    "\\\\\n",
    "\\mathbb{E}X &= \\lambda\n",
    "\\\\\n",
    "\\\\\n",
    "\\text{Var}X &= \\lambda\n",
    "\\end{align*}\n",
    "$$\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lambda = 4 # e.g. an average of 4 meteorite impacts per year."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the probability mass function\n",
    "\n",
    "x = 0:16\n",
    "pmf <- dpois(x,lambda)  # a Poisson distribution with lambda=4\n",
    "\n",
    "set_plot_dimensions(5, 4)\n",
    "plot(x, pmf, ylim=c(0,0.25), xlab=\"x\", type=\"h\", col=\"red\", axes=FALSE)\n",
    "points(x, pmf, col=\"red\")\n",
    "axis(side=1, at=x)\n",
    "axis(side=2)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the cumulative distribution function\n",
    "\n",
    "cdf <- ppois(x,lambda)\n",
    "\n",
    "set_plot_dimensions(5, 4)\n",
    "plot(x, cdf, ylim=c(0,1), xlab=\"x\", type=\"s\", col=\"blue\", axes=FALSE)\n",
    "axis(side=1, at=x)\n",
    "axis(side=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the expected value = lambda\n",
    "\n",
    "expectation <- sum(x * pmf)  # approximates the sum for x->infinity\n",
    "expectation\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the variance = lambda\n",
    "\n",
    "variance <- sum( (x - expectation)^2 * pmf) # approximates the sum for x->infinity\n",
    "variance\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What is the probability of observing between 2 and 4 meteorite impacts in a given year?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ppois(4,lambda) - ppois(1,lambda) # using the CDF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Uniform Distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **uniform distribution** describes a continuous random variable with a flat pdf over a specified interval.\n",
    "\n",
    "$$X \\sim U(a,b)$$\n",
    "\n",
    "e.g. angle of a spinner, where $a=0$ and $b=360$.\n",
    "\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "  f(x) &= \\frac{1}{b-a} \\text{ for } a \\le x \\le b\n",
    "\\\\\n",
    "\\\\\n",
    "\\mathbb{E}X &= \\frac{1}{2}(a+b)\n",
    "\\\\\n",
    "\\\\\n",
    "\\text{Var}X &= \\frac{1}{12}(b-a)^2\n",
    "\\end{align*}\n",
    "$$\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# e.g. angle of a spinner.\n",
    "a <- 0\n",
    "b <- 360"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the probability density function\n",
    "\n",
    "wid <- 0.001\n",
    "x <- seq(-90,450,wid)\n",
    "pdf <- dunif(x,a,b)  # a uniform distribution between 0 and 360\n",
    "\n",
    "set_plot_dimensions(5, 4)\n",
    "plot(x, pdf, xlab=\"x\", ylim=c(0,0.004), type=\"l\", col=\"red\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the cumulative distribution function\n",
    "\n",
    "cdf <- punif(x,a,b)  # a uniform distribution between 0 and 360\n",
    "\n",
    "set_plot_dimensions(5, 4)\n",
    "plot(x, cdf, xlab=\"x\", ylim=c(0,1), type=\"l\", col=\"blue\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the expected value = 0.5 * (a + b)\n",
    "\n",
    "expectation <- sum(x * pdf * wid)  # approximates the integral of (x * pdf(x))\n",
    "expectation\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the variance = 1/12 * (b - a)^2\n",
    "\n",
    "# approximating the integral of [(x - expectation)^2 * pdf] dx\n",
    "variance <- sum( (x - expectation)^2 * pdf * wid)\n",
    "variance\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What is the probability of spinning an angle between 90 and 180 degrees?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "punif(180,a,b) - punif(90,a,b)  # using the CDF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exponential Distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **exponential distribution** describes waiting times between Poisson events.\n",
    "\n",
    "$$X \\sim \\text{exp}(\\lambda)$$\n",
    "\n",
    "e.g. time until a single U-238 atom decays.\n",
    "\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "  f(x) &= \\lambda e^{-\\lambda x}\n",
    "\\\\\n",
    "\\\\\n",
    "\\mathbb{E}X &= \\frac{1}{\\lambda}\n",
    "\\\\\n",
    "\\\\\n",
    "\\text{Var}X &= \\frac{1}{\\lambda^2}\n",
    "\\end{align*}\n",
    "$$\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# X describes the time until the first meteorite impact, in years.\n",
    "\n",
    "lambda <- 4  # e.g. an average of 4 meteorite impacts per year."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# plot the probability density function\n",
    "\n",
    "wid <- 0.001\n",
    "x <- seq(0,10,wid)\n",
    "pdf <- dexp(x,lambda)  # an exponential distribution with rate = 4\n",
    "\n",
    "set_plot_dimensions(5, 4)\n",
    "plot(x, pdf, xlab=\"x\", xlim=c(0,1.5), type=\"l\", col=\"red\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the cumulative distribution function\n",
    "\n",
    "cdf <- pexp(x,lambda)\n",
    "\n",
    "set_plot_dimensions(5, 4)\n",
    "plot(x, cdf, xlab=\"x\", xlim=c(0,1.5), ylim=c(0,1), type=\"l\", col=\"blue\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the expected value = 1/lambda\n",
    "\n",
    "expectation <- sum(x * pdf * wid)  # approximates the integral of [x * pdf(x)] dx\n",
    "expectation\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the variance = (1/lambda)^2\n",
    "\n",
    "# approximating the integral of [(x - expectation)^2 * pdf] dx\n",
    "variance <- sum( (x - expectation)^2 * pdf * wid)  \n",
    "variance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find the probability of observing a meteorite impact during the first half of the year."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pexp(0.5,lambda)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Normal Distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **normal distribution** (also known as the Gaussian distribution) describes many situations associated with measurement. Its parameters are the *mean*, $\\mu$, and the *variance*, $\\sigma^2$:\n",
    "\n",
    "$$X \\sim N(\\mu,\\sigma^2)$$\n",
    "\n",
    "e.g. measured thickness of a piece of paper\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "  f(x) &= \\frac{1}{\\sqrt{2\\pi\\sigma^2}}e^{\\frac{(x-\\mu)^2}{2\\sigma^2}}\n",
    "\\\\\n",
    "\\\\\n",
    "\\mathbb{E}X &= \\mu\n",
    "\\\\\n",
    "\\\\\n",
    "\\text{Var}X &= \\sigma^2\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "The normal distribution can be used as an approximation to the binomial (for large $n$) and the Poisson (for large $\\lambda$).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# X represents paper thickness in microns\n",
    "mu <- 200\n",
    "sigma <- 20"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the probability density function\n",
    "\n",
    "wid <- 0.001\n",
    "x <- seq(100,300,wid)\n",
    "pdf <- dnorm(x,mu,sigma)  # a normal distribution\n",
    "\n",
    "set_plot_dimensions(5, 4)\n",
    "plot(x, pdf, xlab=\"x\", type=\"l\", col=\"red\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# plot the cumulative distribution function\n",
    "\n",
    "cdf <- pnorm(x,mu,sigma)\n",
    "\n",
    "set_plot_dimensions(5, 4)\n",
    "plot(x, cdf, xlab=\"x\", ylim=c(0,1), type=\"l\", col=\"blue\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the expected value = mu\n",
    "\n",
    "expectation <- sum(x * pdf * wid)  # approximates the integral of [x * pdf(x)] dx\n",
    "expectation\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the variance = (sigma)^2\n",
    "\n",
    "# approximating the integral of [(x - expectation)^2 * pdf] dx\n",
    "variance <- sum( (x - expectation)^2 * pdf * wid)  \n",
    "variance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What proportion of measurements are expected to be over 225 $\\mu m$?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "1 - pnorm(225,mu,sigma) # using the CDF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Log-normal distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many processes in biology, chemistry and the social sciences lead to variables that have **log-normal distributions**, that is, $\\log{X}$ follows a normal distribution."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
